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Abstract. In designed experiments and surveys, known laws or de- 
sign feat ures provide checks on the most relevant aspects of a model 
and identify the target parameters. In contrast, in most observational 
studies in the health and social sciences, the primary study data do 
not identify and may not even bound target parameters. Discrepan- 
cies between target and analogous identified parameters (biases) are 
then of paramount concern, which forces a major shift in modeling 
strategies. Conventional approaches are based on conditional testing of 
equality constraints, which correspond to implausible point-mass pri- 
ors. When these constraints are not identified by available data, how- 
ever, no such testing is possible. In response, implausible constraints 
can be relaxed into penalty functions derived from plausible prior dis- 
tributions. The resulting models can be fit within familiar full or partial 
likelihood frameworks. 

The absence of identification renders all analyses part of a sensitivity 
analysis. In this view, results from single models are merely examples 
of what might be plausibly inferred. Nonetheless, just one plausible 
inference may suffice to demonstrate inherent limitations of the data. 
Points are illustrated with misclassified data from a study of sudden 
infant death syndrome. Extensions to confounding, selection bias and 
more complex data structures are outlined. 

Key words and phrases: Bias, biostatistics, causality, epidemiology, 
measurement error, misclassification, observational studies, odds ratio, 
relative risk, risk analysis, risk assessment, selection bias, validation. 



Sander Greenland is Professor, Departments of 
Epidemiology and Statistics, University of California, 
Los Angeles, California 90095-1772, USA e-mail: 
lesdomes@ucla.edu. 

This is an electronic reprint of the original article 
published by the Institute of Mathematical Statistics in 
Statistical Science, 2009, Vol. 24, No. 2, 195-210. This 
reprint differs from the original in pagination and 
typographic detail. 



1. BACKGROUND 

1.1 Observational Epidemiologic Data Identify 
Nothing 

With few exceptions, observational data in the 
health and social sciences identify no parameter what- 
soever unless assumptions of uncertain status are 
made (Greenland, 2005a). Even so-called "nonpara- 
metric identification" depends on assumptions that 
are only working hypotheses, such as absence of un- 
controlled bias. Worse, close inspection of the ac- 
tual processes producing the data usually reveals 
far more complexity than can be fully modeled in a 
reasonable length of time. Formal analyses, no mat- 
ter how mathematically sound and elegant, never 
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fully capture the uncertainty warranted for infer- 
ences from such data, and in epidemiology and medic: 
often lead to inferences that are later judged much 
farther off target than random error alone could ex- 
plain (Lawlor et al., 2004). 

Many of the issues can be seen in simple cases. 
Suppose our target parameter is the prevalence 
Pr(r = 1) in a target population of a health-related 
exposure indicator T, to be estimated from a sam- 
ple of persons for whom T is measured. If A is 
the number of sampled persons with T = 1, the con- 
ventional binomial model leads to an unbiased esti- 
mator A/N of Pr(r = 1) and many procedures for 
constructing interval estimates. But N is often con- 
siderably less than the number of eligible persons for 
whom contact attempts were made, leaving uncer- 
tainty about what subset of the target was actually 
sampled. One may not even be able to assume in- 
dependent responses: Physicians may refuse to pro- 
vide their entire block of patients, or patients may 
encourage friends to participate, and these actions 
may be related to T. Thus, the binomial model for A 
is a convention adopted uncritically from simple and 
complete random surveys; there are many ways this 
model may fail, none testable from the data (A, N) 
alone. 

Next, suppose we have only an imperfect mea- 
sure X of T (e.g., a self-report of T) in the sam- 
ple. The observable variable is now the count A of 
X = 1, and binomial inference on Pr(T = 1) from 
A/N alone is unsupported even with random sam- 
pling. Yet the usual if implicit convention is to pre- 
tend that Pr(X = 1) = Pr(T = 1) and discuss the 
impact of violations intuitively (e.g.. Fortes et al., 
2008). This is a poor strategy because the conven- 
tion derives from the assumption X = T, which no 
informed observer holds, and anchors subsequent in- 
tuitions to this extreme case. The problem can be 
addressed by obtaining error-free measurements of 
T, but that is often impossible, for example, when T 
represents a lifetime chemical exposure or nutrient 
intake. At best we might obtain alternate measure- 
ments of T and incorporate them into a predictive 
model for T, which would have its own nonidentified 
features. 

1.2 Identification versus Plausibility 

To summarize the problem, conventional statis- 
tics are derived from design mechanisms (such as 
random sampling) or known physical laws that en- 
force the assumed data model; but studies based on 



passive observation of health and social phenomena 
^such as studies of health-care data bases) have little 
or nothing in the way of such enforcement, leaving 
us no assurance that conventional statistics (even 
when nonparametric) are estimating the parameter 
of interest. Furthermore, because the actual data- 
generating process depends on latent variables re- 
lated to the target parameter, that parameter is not 
identified from the observed data. These studies thus 
suffer from a curse of nonidentification, in that iden- 
tification can be achieved only by adding constraints 
that are neither enforced by known mechanisms nor 
testable with the observed data. 

In light of this problem, many epidemiologic au- 
thors have emphasized the need to unmoor observa- 
tional data analysis from conventional anchors 
(Philhps, 2003; Lash and Fink, 2003; Maldonado, 
2008). There is now a vast literature on models to 
fulfill this need, sometimes described under the gen- 
eral heading of bias analysis (Greenland, 2003a, 2005a, 
2009; Greenland and Lash, 2008). Examples include 
models for selection biases (Copas, 1999; Geneletti 
et al., 2009; Scharfstein, Rotnitsky and Robins, 1999, 
2003), nonignorable missingness and treatment as- 
signment (Kadane, 1993; Baker, 1996; Moleberghs, 
Kenward and Goetghebeur, 2001; Little and Ru- 
bin, 2002; Robins, Rotnitzky and Scharfstein, 2000; 
Rosenbaum, 2002; Vansteelandt et al., 2006), uncon- 
trolled or collinear confounders (Bross, 1967; Leamer, 
1974; Greenland, 2003a; Gustafson and Greenland, 
2006; McCandless et al., 2007; Yanagawa, 1984), 
measurement error (Gustafson, 2003, 2005; Green- 
land, 2009) and multiple biases (Eddy, Hasselblad 
and Shachter, 1992; Greenland, 2003a, 2005a; Moli- 
tor et al., 2008; Goubar et al., 2008; Turner et al., 
2009; Welton et al., 2009). 

Despite the profusion of literature on the topic, 
integration of core ideas of bias analysis into ba- 
sic education, software and practice has been slow. 
One obstacle may be the diversity of approaches pro- 
posed. Another may be the failure to connect them 
to familiar, established methods. Yet another obsta- 
cle may be the greater demand for contextual input 
that most require. Central to that input is the infor- 
mal but crucial concept of a plausible model. I will 
call a model plausible if it appears to neither conflict 
with accepted facts nor assume far more facts than 
are in evidence. Implausible models are then models 
rejectable a priori as either conflicting with or going 
too far beyond existing background information. 



PLAUSIBLE MODELING OF NONIDENTIFIED BIASES 



3 



The distinction between plausible and implausible 
models is fuzzy, shifting and disputable, but many 
models will clearly be implausible within a given 
context. For example, models that assume zero ex- 
posure measurement error {X = T above) are very 
implausible in environmental, occupational and nu- 
tritional epidemiology, because no one can plausi- 
bly argue such errors are absent. In fact, most con- 
ventional data-probability models appear implausi- 
ble in epidemiologic contexts. Such models are of- 
ten rationalized as providing data summaries about 
identified parameters such as Vt[X = 1) above, but 
their outputs are invariably interpreted as inferences 
about targets such as Pr(T = 1). Avoiding such mis- 
interpretations requires model expansion into the 
nonidentified dimensions that connect observables 
to targets. 

Plausibility concepts apply to models for prior 
probabilities as well as to models for data-generating 
processes. For example, consider a prior for a dis- 
ease prevalence vr that assigned Pr(7r = 0.5) = 0.5 
and was uniform over the rest of the unit inter- 
val. This prior would be implausible as an informed 
opinion because no genuine epidemiologic evidence 
could provide such profound support for vr = 0.5 and 
yet fail to distinguish among every other possibility. 
Analogous criticisms apply to most applications of 
"spike-and-slab" priors in the health and social sci- 
ences. 

1.3 Outline of Paper 

The present article reviews the above points, fo- 
cusing on plausible extensions of conventional mod- 
els in order to simplify bias analysis for teaching and 
facilitate its conduct with ordinary software. It be- 
gins by outlining a likelihood-based framework for 
observational data analysis that mixes frequentist 
and Bayesian ideas (as has long been recommended, 
e.g.. Box, 1980; Good, 1983). It stands in explicit op- 
position to the notion that use of priors demands a 
fully Bayesian framework or exact posterior compu- 
tation, even though partial priors in some form are 
essential for inferences on nonidentified parameters. 
Instead, it encourages use of partial priors as iden- 
tifying penalty functions. These functions may be 
translated into augmenting data, which aids plau- 
sibility evaluation and facilitates computation with 
familiar likelihood and estimating-equation software. 

Section 3 illustrates points with data from a large 
collaborative case-control study of sudden infant 
death syndrome (SIDS). It starts with conventional 



analyses of the data, describes a misclassification 
problem, then provides analyses using priors only 
for nonidentified parameters. Section 4 then outlines 
extensions to "validation" data, and describes how 
the misclassification model can be re-interpreted to 
handle uncontrolled confounders and selection bias. 
Throughout, the settings of concern are those in 
which the data have been collected but a "correct" 
model for their generation can never be known or 
even approximated. In these settings, we cannot even 
guarantee that inferences from the posterior will be 
superior to inferences from the prior (Neath and 
Samaniego, 1997). Thus, the importance of specific 
models and priors is de-emphasized in favor of pro- 
viding a framework for sensitivity analysis across 
plausible models and priors. This framework need 
not be all-encompassing, because often just a few 
plausible specifications can usefully illustrate the il- 
lusory nature of an apparently conclusive conven- 
tional analysis. 

2. PRIORS AND PENALTIES AS TOOLS FOR 
ENHANCING MODEL PLAUSIBILITY 

2.1 Models and Constraints 

The formalism used here is similar to that in Green- 
land (2005a) and Vansteelandt et al. (2006), tailored 
to a profile penalized-likelihood approach. Consider 
a family of models G = {G(a; 7, 6) : (7, 0) G T x 0} 
for the distribution of an observable-data structure 
A taking values a in a sample space A with G satisfy- 
ing any necessary regularity conditions. The inferen- 
tial target parameter will be a function t = t(7, 0) 
of the model parameters. 

G represents a set of constraints on an unknown 
objective frequency distribution or "law" for A. In 
classical applied statistics these constraints are in- 
duced by study design or physical laws. In contrast, 
in observational health and social sciences these con- 
straints are largely or entirely hypothetical, which 
motivates the present treatment. The separation of 
the total parameter (7, 9) into components 7 and 
is intended to reflect some conceptual distinction 
that drives subsequent analyses and will be clarifled 
below. The assumption (7, 0) € F x (variation 
independence of 7 and 6) provides technical sim- 
plifications when using partial priors (Gelfand and 
Smith, 1999) and will be discussed in Section 3.7. 

When (7, 0) is identified, 7 could contain the pa- 
rameters considered essential to retain in the model. 
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whereas 6 could contain parameters considered op- 
tional, as when 6 contains regression coefficients of 
candidate variables for deletion. Conventional mod- 
eling then considers only the following: 

(1) Equality constraints (also known as "hard," 
"sharp" or "point" constraints) of the form r(0) = 
c, where c is a known constant (usually 0). This 
reduces the model family to {^(a; 7, 6) : (7, 9) G 
r X r~^(c)}, where r~^(c) is the preimage of 
c in 0, and constrains t to {t(7,0) : (7,6) € 
r X r-i(c)}. 

(2) No constraint apart from logical bounds (e.g., 
and 1 for a probability): both 7 and 9 are 
treated as "unknown constants," which corre- 
sponds to no constraint on the target beyond 
r = t(7,0). 

The choice between these extremes is usually based 
on a test of the constraint r{9) = c, often derived 
from the likelihood function L{-f,6;a) = G(a;7,0) 
when G is an exponential family, for example, by 
contrasting the maximum of the deviance — 2L(7, 0; a) 
with and without the constraint. 

In the problems considered in the present paper, 
options (1) and (2) are not available because 9 is 
not identified, in the sense that, for each a G ^, the 
profile likelihood L(0;sl) = max-y^r L('y,9;a.) is con- 
stant. Thus, no test of r{9) = c is available without 
introducing other nonidentified constraints. Consider 
again the misclassification example observing A = a 
for the X = 1 count, with r = Pr(r = 1) and 6 = 
Fr{X = 1) -Pr(r = 1). Then L(t, 9; a) a (r-h6')'^(l - 
r - 6')^-» with L{e;a) = (a/A^)"(l - a/N)^-'', a 
constant; thus, we cannot test ^ = to evaluate 
use of X for T in inference about r. In fact, we 
may reparameterize to remove 9 from the likelihood: 
Defining 7 = Pr(X = 1), we obtain Pr(T = 1) = r = 
^ — 9 and L{'y,9;a) oc 7^(1 — 7)^"", a transparent 
parameterization (Gustafson, 2005). This parame- 
terization shows that observation a places no con- 
straint on 9 and hence no constraint on the target 
parameter t = j — 9. Thus, r is not even partially 
identified, despite having an identified component 7. 

2.2 Sensitivity to Bias Parameters 

Because 9 determines the discrepancy between 
the target r and the identified parameter 7 often 
estimated as if it were the target, 9 may be called 
a bias parameter (Greenland, 2005a). Because in- 
ferences that are sensitive to nonidentified param- 
eters will remain asymptotically sensitive to con- 
straints on those parameters, 9 has also been called 



a sensitivity parameter (Moleberghs, Kenward and 
Goetghebeur, 2001). Conventional sensitivity anal- 
ysis shows how inferences change as equality con- 
straints are varied, for example, as c in = c is var- 
ied (Rosenbaum, 2002; Greenland and Lash, 2008). 

Vansteelandt et al. (2006) allow relaxation of such 
point constraints into a constraint of the form G R, 
where R represents a plausible range for 9. This con- 
straint may be written as r{9) = 1, where r{9) is the 
membership indicator for R. Let 7q be the true value 
of 7, which is unknown but identified. Assuming 
G R then constrains r to {r (7, 9) : (7, 0) G F x R} 
and identifies the set {t(7q,0) -.9 G R} = t(7o,R). 
Vansteelandt et al. call t(7q,R) an ignorance region 
for T and propose frequentist estimators of this re- 
gion as a means of summarizing a sensitivity analysis 
in which 9 is varied over R. To illustrate with the 
misclassification example, the constraint \9\ < 0.2 
corresponds to G R = [0,0.2) and identifies the ig- 
norance region t(7q, R) = {I70 — 6*1 : \9\ < 0.2}. 

From a Bayesian perspective, sensitivity analysis 
by varying c in r{9) = c can be viewed as analy- 
ses of sensitivity to priors in which the priors are 
limited to point masses Pr(r(0) = c) = 1. Similarly, 
analyses based on G R can be viewed as using a 
prior restricted to R. Why limit analyses to such 
sharply bounded constraints or priors? A practical 
argument might be that there are too many pos- 
sible constraints or priors and, thus, some limit on 
their form is needed. But typical equality constraints 
(point priors) are very implausible, insofar as they 
make assertions far beyond that warranted by avail- 
able evidence; that is, they are much too informa- 
tive. Similarly, restricting to a sharply bounded 
region R risks completely excluding values of 9 that 
are plausible (and perhaps correct); expansion of R 
to avoid this risk may result in a practically unin- 
formative region t(7q,R) for t. A broad region R 
also ignores what may be substantial differences in 
plausibility among its members. 

2.3 Relaxation Penalties and Priors 

To address the deficiencies of point constraints 
for 9, we may instead relax (expand) the constraints 
into a family D = {D{9; A) : A G A} of penalty func- 
tions indexed by A which subsumes the point con- 
straints as special or limiting cases. This situation 
arises in scatterplot smoothing, where 7 may con- 
tain an intercept, linear and quadratic term and 9 
may contain distinct cubic terms for each design 
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point, thus leaving (7, 6) nonidentified. Point con- 
straints (such as setting ah nonhnear terms to zero) 
exclude entire dimensions of the regression space, 
and hence risk oversmoothing. In contrast, having 
no constraint gives back the raw data points as the 
fitted curve, resulting in no smoothing. Penalties 
provide "soft" or "fuzzy" constraints that relax the 
sharp constraints of conventional models to produce 
smooth curves between these extremes (Hastie and 
Tibshirani, 1990). 

Penalization is a form of shrinkage estimation, 
wherein asymptotic unbiasedness may be sacrificed 
in exchange for reduced expected loss. Nonetheless, 
in identified models mild penalties can also reduce 
asymptotic bias relative to ordinary maximum like- 
lihood (Bull, Lewinger and Lee, 2007). In observa- 
tional studies the potential gain from penalization is 
far greater because unbiasedness can be derived only 
by assuming greatly oversimplified models that are 
likely false (Greenland, 2005a). In particular, an es- 
timator unbiased under a model G(a;7,c) may suf- 
fer enormous bias if the point constraint = c is in- 
correct. Penalties that relax = c to a weaker form 
can reduce this source of bias (Greenland, 2000; 
Gustafson and Greenland, 2006, 2010), although un- 
biasedness is arguably an unrealistic goal in these 
settings. 

Given D, popular strategies for choosing A include 
empirical-Bayes and cross-validation (Hastie, Tib- 
shirani and Friedman, 2001). In the present appli- 
cations, however, A is not identified and so external 
grounds for choosing A are needed. The interpreta- 
tion of a penalty function D{9;X) as the transform 
—2ln{H(9; X)} of a prior density H(6;X) can pro- 
vide contextual guidance for making good choices. 
To illustrate, let A= (//,W) with W a positive- 
definite information matrix. Then the quadratic (gen- 
eralized ridge-regression) penalty {6 — /i)'W(0 — fx) 
corresponds to a normal(//, W"^) prior on 6 (Titter- 
ington, 1985). For diagonal W the absolute (Lasso) 
penalty \0 — yLi|'w^/^ corresponds to independent 
double-exponential (Laplacian) priors with mean and 
scale vectors /i and w~^/^ where w is the diagonal 
of W (Tibshirani, 1996). 

Taking = c, in either case the point constraint 
= c is now the limiting penalty as goes to 

0, and thus corresponds to infinite information. We 
should instead want to choose W such that the re- 
sulting H{9; A) is no more informative than we find 
plausible and assigns more than negligible odds (rel- 
ative to the maximum) to all plausible possibilities. 



The form of the resulting penalty will allow varying 
degrees of plausibility over 0, as it should; c be- 
comes the most plausible value. This use of priors 
to relax sharp constraints on nonidentified param- 
eters does not entail commitment to Bayesian phi- 
losophy, since the resulting penalized estimators can 
still be evaluated based in part on their frequency 
properties (Gustafson, 2005; Gustafson and Green- 
land, 2006, 2010). 

In general, interpretation of a given penalty D{9; A) 
involves transformation to see which if any A yield 
contextually reasonable prior densities H{6; A) oc 
exp(-L>(0; A)/2). If exp{-D{9;X)/2) is not inte- 
grable, D{0;X) will not completely identify 9 (the 
implied prior is improper), although r or some lower- 
dimensional function of it may still have a proper 
prior (Gelfand and Sahu, 1999). One may also vary 
A to assess sensitivity to its choice, or give A a prior 
as in Bayes empirical-Bayes estimation (Deely and 
Lindley, 1981; Samaniego and Neath, 1996). 

2.4 Partial Priors 

For a pragmatic frequentist who uses priors to 
impose penalties or soft constraints, use of a prior 
for 9 alone is natural. For a pragmatic Bayesian, 
placing a prior on 9 alone is an effort-conserving 
strategy, recognizing that thorough exploration of 
all priors is infeasible and that the dimensions de- 
manding greatest care in prior specification are the 
nonidentified ones (Neath and Samaniego, 1997). In 
contrast, some or all identified dimensions may be 
judged not worth the effort of formalizing, especially 
if the data have enough information in those dimen- 
sions to overwhelm any cautious or vague prior. 

With prior specification limited to 9, the implicit 
prior ^5(7, 9) = p{9) is improper on F, with pos- 
terior p(7,0|a) (X L{'y,9;a)H{9;X). The log poste- 
rior is then a loglikelihood for (7, 9) with penalty 
—2lii{H{9;X)}. The resulting penalized-likelihood 
analyses have been called partial-Bayes or semi-Bayes 
(Cox, 1975; Greenland, 1992; Bedrick, Christensen 
and Johnson, 1996); they can also be applied if ^(7, 
9; a) is a partial, pseudo or weighted likelihood. Con- 
ventional analyses are extreme cases in which 9 is 
given a point prior. Identification of {'~f,9) by 
L(7, 9; a)H{9; A) corresponds to a unique maximum 
penalized-likelihood estimate (MPLE) and a proper 
posterior distribution for (7,^). 

2.5 Plausible Penalties and Data Priors 

Not all penalties or priors will appear plausible. 
One way to evaluate plausibility is to construct a 
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thought experiment with sample space B such that 
H{e; A) is the profile likelihood L(0; bA, A) for 6 de- 
rived from an outcome G B. Specifically, we ex- 
amine a family F = {F(b; 9, A, S), {0, 5) G x A} 
of distributions conjugate to the prior-distribution 
family H in that the "data prior" b^ yields L{9; hx, 
A) = max^gA -^(bA; 0, A, S) = H{6; A) (Higgins and 
Spiegelhalter, 2002; Greenland, 2003b, 2007a, 2007b, 
2009); S contains any nuisance parameters in the 
chosen experiment. The experiment and its parame- 
terization is chosen to make 6 variation independent 
of 7; there may, however, be no need for S and so it 
will henceforth be dropped. 

To illustrate, consider again the binomial-survey 
example reparameterized to 7 = logit{Pr(X = 1)} 
and logit{Pr(r = 1)} = 7 — ^, so that 9 now repre- 
sents the asymptotic bias in logit(^/A^) as an esti- 
mator of logit{Pr(T = 1)}. A convenient prior fam- 
ily for 6 is the generalized-conjugate or log-F dis- 
tribution (Greenland, 2003a, 2003b, 2003c; Jones, 
2004), which has density H{e;\) oc e^"''/(l + e^)" 
where z = {9 + logit(r) — m)/s and A = [m, s, r, n]'; 
m and s are the desired mode and scale for the 
prior, < r < 1 controls skewness, and n > con- 
trols tail weight (thinner tails as n increases). When 
r = 0.5 this H{9] A) is symmetric; it then equals the 
logistic density when n = 2, and rapidly approaches 
normality as n increases. It also equals a likelihood 
F{hx]B, A) oc L{9- 5a, A) = e^V(l + e'T from a sin- 
gle binomial observation of hx = nr successes on n 
trials when the success probability is e^/(l + e^); 
thus, our prior-generating experiment is a draw of 
bx from the binomial F{b;9,X). 

The representation (,S,F,bA) will not be unique, 
reflecting that different experiments may yield the 
same likelihood function. This is no problem; in fact, 
alternate representations can help gauge the knowl- 
edge claims implicit in the prior H{0;X). Translat- 
ing priors into a likelihood of the form L{9; hx, A) = 
F{hx',9, X) provides a measure of information in 
H{9; A) that can be appreciated in terms of effec- 
tive sample size (above, the total n in bA) and other 
practical features that would produce this informa- 
tion. The exercise thus helps judge whether H{9; A) 
is implausibly informative (Greenland, 2006). It also 
allows sensitivity analysis based on varying bA , which 
may be more intuitive than analyses based on the 
original parameters in A. 

Note that conjugacy of H with the actual data 
model G is not required: G and F may be differ- 
ent distributional families, so that the resulting ac- 
tual likelihood L{'j, 9; a) and the "prior likelihood" 



L{9; bA, A) may be different functional forms related 
only through 9. 

2.6 Computation 

The penalized loglikelihood \n{L{'y,9;a)H{9; 
A)} = ln{L{-f , 9; a.)L{9;hx, X)} can be summarized 
in the usual way, with the maximum and negative 
Hessian (observed penalized information) supplying 
approximate posterior means and standard devia- 
tions (Leonard and Hsu, 1999). When \n{H{9;X)} 
itself decomposes into a sum of "prior likelihood" 
components, as when bA is a vector of indepen- 
dent prior observations, the result is typically a pos- 
terior more normal than L(7,0;a). This improves 
the numerical accuracy of posterior tail-area approx- 
imations based on the profile-penalized likelihood 
(Leonard and Hsu, 1999), which can be remarkably 
close to exact tail areas even with highly skewed 
distributions (Greenland, 2003b). If H and G are 
conjugate, approximate Bayesian inferences can be 
obtained simply by appending the prior data bA 
to the actual data a and entering the augmented 
data set into ordinary maximum-likelihood software 
along with appropriate offsets (Greenland, 2003b, 
2007a, 2007b, 2009). 

Nonetheless, posterior simulation is widely pre- 
ferred for Bayesian analyses. Markov chain samplers 
sometimes incur burdens due to autocorrelation and 
convergence failure, especially when dealing with non- 
identified models and improper priors. Under a trans- 
parent parameterization withp(7, 0|a) = p{'y\aL)p{9\-f) 
and G(a;7,0) = G(a;7), we can instead make inde- 
pendent draws from p{'y,9\a) if we can make in- 
dependent draws 7* from p(7|a), then draw from 
p{9\^*). In the application below, ^(7) is constant 
or conjugate with G(a;7), hence, p(7|a) is conju- 
gate and easy to independently sample when G(a; 7) 
is a conventional count model. Additionally, with a 
partial prior p{'j,9) =p{9), or more generally with 
p{'j,9) =p{'y)p{9), drawing from p{9\'-f) reduces to 
drawing from p{9) = H{9; A). 

3. ANALYSES OF A SIDS STUDY 

3.1 Conventional Analyses 

Table 1 presents the relation of maternal antibi- 
otic report during pregnancy (X) to SIDS occur- 
rence (Y) (Kraus, Greenland and Bulterys, 1989). 
Given the rarity of SIDS, the underlying popula- 
tion risk ratio comparing the exposed to the unex- 
posed (X = 1 vs. X = 0) is well approximated by 
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the corresponding odds ratio ORxY- Thus, we may 
take this odds ratio or (3 = Xh^ORxy) as the tar- 
get parameter. The usual maximum-hkehhood es- 
timate (MLE) of ORxY is the sample odds ratio 
173(663)/134(602) = 1.42, with standard error for 
the log j3 of (1/173 + 1/602 + 1/134 + 1/663)^/2 ^ 
0.128 and 95% confidence limits (CL) for ORxv 
of exp{ln(1.42) ± 1.96 • 0.128} = 1.11,1.83. Absent 
major concerns about bias, such results would com- 
monly be interpreted as providing an inference that 
ORxY is above 1 but below 2. 

Consider next a prior for the odds ratio. At the 
time of the study only weak speculations could be 
made. Not even a direction could be asserted: An- 
tibiotics might be associated with elevated risk (mark- 
ing effects on the fetus of an infection, or via direct 
effects) or with reduced risk (by reducing presence of 
infectious agents). Nonetheless, by the time of the 
SIDS study, US antibiotic prevalence had climbed 
to 20% over the preceding four decades and yet the 
SIDS rate remained a fraction of a percent. This 
high exposure prevalence and the prominence of the 
outcome effectively ruled out odds ratios on the or- 
der of 5 or more because those would have gener- 
ated notably higher background SIDS risk in earlier 
studies and surveillance. Thus, one plausible start- 
ing prior would have placed 2 : 1 odds on ORxY 
between ^ and 2, and 95% probability on ORxy 
between | and 4. These initial bets follow from a 
normal(/i, o"^) prior for /? that satisfies exp(/u±1.96- 
cj) = i,4. Solving, we get E{I3) = = 0,0-^ = i, for 
a penalty of (/? — iif ja"^ = 2/3^, hence subtraction of 
from the loglikelhood. 

Let SiXY = (663, 134, 602, 173)' be the vector of 
counts from Table 1. Without further prior specifi- 
cation, the maximum penalized-likelihood estimate 
and posterior mode of (3 is 0.341, so p(/3|axy) is ap- 
proximately normal with mean £^(/3|axy) = 0.341 
and standard deviation 0.126. These yield an ap- 
proximate posterior median for ORxy = of 

Table 1 

Data from case-control study of SIDS (Kraus, Greenland and 
Bulterys, 1989). X indicates maternal recall of antibiotic 
use during pregnancy and Y indicates SIDS 
(Y — 1 for cases, Y = for controls) 





X = l 


X = 


Y=l 


173 


602 


Y = 


134 


663 



exp(0.341) = 1.41 with Wald-type 95% posterior lim- 
its of exp(0.341 ± 1.96 • 0.126) = 1.10, 1.80. These re- 
sults are barely distinct from the conventional re- 
sults because the conventional likelihood dominates 
the prior. 

3.2 Model Expansion to Accommodate 
Misclassification 

Although the above antibiotic-SIDS prior makes 
little difference using the conventional likelihood, it 
makes a profound difference when we expand the 
likelihood to allow for misclassification. X repre- 
sents only mother's report of antibiotic use. Let T be 
the indicator of actual (true) antibiotic use. There 
is no doubt that mistaken reports (T ^ X) occur. 
Moreover, recall bias seems likely, with false posi- 
tives more frequent among cases and false negatives 
more frequent among controls (more T <X when 
y = 1, more T > X when y = 0). 

Let Atxy be the unobserved count variable at T = 
t, X = x,Y = y, Efxy = E{Atxy) and let a sub- 
script indicate summation over the subscript. Axy = 
(j4+00) ^+10) ^+oii ^+ii)' is now the vector of mar- 
ginal XY count variables with Ejfy = E{Axy) = 
{E+QO, E^iQ, E^Qi, E+iiY . The problem can then be 
restated as follows: We observe only the XY mar- 
gin Axy and get an estimator ^+11^+00/^+10^+01 
of the marginal XY odds ratio ORxy = E^uE^qq/ 
E'+io-E+oi- But the odds ratio of substantive inter- 
est (i.e., the real target parameter r) is the marginal 
TY odds ratio r = ORty = Ei^iEq^q/ Ei^qEq^i. 

With no measurement of T, data on T are missing 
for everyone (T is latent) and ORty is not identi- 
fied or even bounded by Axy - To estimate ORty, 
we need information linking T to X and Y, such as 
prior distributions, subjects with data on T as well 
as X and Y , or both. Examples include information 
on predicting T from XY, that is, information on 
the predictive values ntxy = Pr(r = t\X = x,Y = y). 
Because iroxy = 1 — T^ixyi there are only four dis- 
tinct classification parameters, which may be taken 
as 7Vt = (vTiii, TTiio, TTioi, TTioo)'- Kuowiug ttt would 
allow us to impute T in the data, as shown in Table 
2. Unfortunately, the XY data in Table 1 say noth- 
ing about ttt, that is, tvt is not identified by those 
data. One must impose supplementary constraints 
to say anything about the target ORty based on 
the XY data. 

Despite these problems, many epidemiologists an- 
chor inferences for the target parameter tightly around 
uncorrected estimates. Here, the ORxy estimate 
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Table 2 

Imputed complete-data table from SIDS study. T indicates 
actual antibiotic use during pregnancy 





X 


= 1 


X 


= 




Y = l 


Y = 


Y = 1 


Y = 


T=l 
T = 

Totals 


1737riii 
1737roii 
173 


1347riio 
1347roio 
134 


6027rioi 
6027rooi 
602 


6637rioo 
6637rooo 
663 



(1.42, 95% limits 1.11, 1.83) is exactly what one gets 
for ORty by assuming X = T (no error in X). It is 
also the answer from a semi-Bayes analysis using a 
degenerate (single-point-mass) prior for ttt that as- 
signs Pr(7riij^ = 1) = Pr(7rioj; = 0) = 1, an extreme 
prior which no one holds. In other words, basing in- 
ference on the conventional results relies on highly 
implausible equality constraint; it takes no account 
of the actual uncertainty or prior information about 
TTr, which is vague but at least bounds the nixy 
away from and 1. The same criticism applies to 
the conventional Bayesian result (1.41, 95% limits 
1.10, 1.80), which are based on the same equality 
constraint for ttt- 

3.3 Loglinear Parameterization 

Because nonidentification makes inferences arbi- 
trarily sensitive to the prior, it is essential to con- 
sider parameterizations with simple contextual mean- 
ings so that sensible priors can be posited. The set 
of expected counts Etxy could be taken as a sat- 
urated parameterization for the joint distribution 
of the Atxy ■ One reparameterization that facilitates 
both prior specification and use of conventional soft- 
ware is 

EtxyiP) = exp(/3o + I^Tt + Pxx + /3yy 
(1) + Pxxtx + l3TYty 

+ (3xYxy + I^TXYtxy), 

where (3 = {Po, Px, I^Y , PxY , Pt, Ptx, (^ty , Ptxy)' ■ 
Dependence of Etxy on j3 will be left implicit be- 
low. The T^ixy follow a saturated logistic model for 
the regression of T on X and Y: 

^ixy = ^T{T=l\X = x,Y = y) 

(2) 

= expit(/3T + Ptxx + PrYy + PxxYxy), 

where expit(u) = e"/(l + e"). ttt is a 1-1 function of 
the parameter subvector = (/3t, Ptx , Pty , Ptxy)' 



of coefficients in the imputation model for the miss- 
ing T data. In the earlier general notation, 9 = fSrp. 

The TV odds ratio when X = is exp(/37^y) and 
is related to the target ORty by 

RiPx) 

= ORty / exp(/3Ty) 

= {1 + exp(/3x + Ptx + PxY + Ptxy)} 
•{l + exp(/3x)} 

/{I + exp(/3x + /3tx)}{1 + exp(^x + Pxy)}, 

where (3x = {Px,Ptx,Pxy,Ptxy)' ■ The latter ex- 
pression is a factor for a problem in which X is a con- 
founder rather than a measurement of T (Yanagawa, 
1984); it can also be used to represent selection bias 
(see below). Here it is useful for deriving the prior for 
(3ty from priors or constraints on (Bx and ORty- 
For example, ascertainment of X before Y occurs 
may lead to X_lLy|T (nondifferential misclassifica- 
tion), which is equivalent to Pxy = Ptxy = 0; in 
that case R{f3x) = 1 and exp{(3TY) = ORty, so that 
the priors on exp{f3TY) and ORty must be identi- 
cal. Nondifferentiality can be relaxed by using priors 
centered at zero for Pxy and Ptxy- 

3.4 Transparent Reparameterization 

The full model involves 8 parameters for the 4 ob- 
servations, and no component of j3 is identified with- 
out some constraint. We can, however, reparameter- 
ize the saturated model into an identified parameter 
Exy and the nonidentified Try or (equivalently) j3j^^ 

Etxy — Ej^xyT^txy 
(3) = E+xydy^V'^^iPT + PTXtx + PTYty 

+ PTXYtxy). 

In the general notation we have a = S-XY , 7 = Exy , 
e = PT, and G(a; 7, 6) = Pr(Axy = ^xyI^xy , Pt) = 
Pr(Axy = axy |Exy). The likelihood depends solely 
on the identified parameter Exy (i.e., for any con- 
stant c, Exy = c defines a level set of the likeli- 
hood surface). It follows that there is no updating of 
p{I3t I Exy ) , that is, p{Pt I Exy , A^y ) = p(/3t I Exy ) , 
hence,p(/3y,Exy|Axy) =p(/3T|Exy)p(Exy|Axy). 

Because Et+y = E+iyixtiy + Ej^oyirtoy, the target 
parameter ORty = -E'i+i-Bo+o/-Si+o-E'o+i is a mix- 
ture of identified parameters Ej^xy and nonidentified 
■Ktxy Thus, ORty may be updated both through 
p(Exy|Axy) and p{(3t\Eixy)'-, but with 
p(/3y |Exy ) = p[(3t)i as here, the update will involve 
only p(Exy|Axy). 
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Table 3 

Parameters of normal priors for coefficients in the logistic regression of T on X and Y (prior for Ptx + Ptxy induced by 

priors for Ptx and Ptxy ) 





Mean 


Variance 


95% prior limits for 


n = 2b = 4/variance* 




logit(O.l) 


0.16 


expit(/3T): 0.05,0.20 


25 


Ptx 


ln(13.5) 


0.25 


exp(/3Tx): 5,36 


16 


[3ty 





0.50 


exp(/3Ty): |,4 


8 


Ptxy 





0.125 


exp(/3Txy): §,2 


32 


Ptx + Ptxy 


ln(13.5) 


0.375 


exp(/3TX +/3Txy):4.1,45 


(not used) 



'Number of binomial trials needed to make asymptotic variance estimate of logit(-B/7i) equal to prior variance when the 
number of successes B is b = n/2; used for penalized estimation. 



3.5 An Initial Prior Specification 

As with specifications for regression models, no 
prior distribution could be claimed "correct." None- 
theless, some specifications are plausible and others 
are not in light of background information. For the 
nonidentified T-predictive parameter /J^^, consider 
first Pr(T = 1|X = 0), the probability among non- 
cases that a "test negative" {X = 0) is erroneous. 
Because of SIDS rarity we have Pr(T = 1|X = 0) ~ 
Pr(T = l\X = 0,Y = 0) = expit(/3T) = vrioo- Antibi- 
otic prevalence Pr(T = 1) in unselected pregnancies 
was expected to be well below 50%, hence, we should 
expect TTioo to be small but nonzero to reflect false 
negatives. These considerations suggest that plausi- 
ble distributions for expit(/3T) include some placing 
95% probability between 0.05 and 0.20. 

Next, let (fxty = Pr(X = x\T = t,Y = y). Then 
ifuy = expit(/3x + pTxt + PxYV + hxYty) and the 
y-specific receiver-operating characteristic (ROC) 
odds ratios (true-positive odds/false-positive odds) 
are 

ORrxiv) = {'flly / miy) / iVWy / <{>my) 

= (7riiy/vrio2;)/(7roiy/7rooj/) 

= exp(/3TX + Ptxyv)- 

If X is pure noise, a p-coin flip, then ipny = ipioy = p, 
f^TX = I^TXY = and ORxxiu) = 1- Background 
literature (e.g., Werler et al., 1989) suggests X is 
nowhere near this bad. Plausible values for (fuo in- 
clude 0.6 and 0.8, and for (fioo include 0.1 and 0.2, 
hence, plausible values for ORrxi^) = expO^Tx) in- 
clude (0.6/0.4)/(0.2/0.8) =6, (0.6/0.4)/(0.1/0.9) = 
13.5, (0.8/0.2)7(0.2/0.8) = 16, (0.8/0.2)/(0.1/0.9) = 
36, suggesting that plausible distributions for 
exp(/3x'x) include some with at least 95% probabil- 
ity between 5 and 40. 



The greater uncertainty about ORrxi^) = 
ex.p{l3TX + /^txy), the ROC odds ratio among cases, 
is captured by ORtx{^)/ ORtx{0) = gw{/^txy), 
which exceeds 1 if cases have more accurate recall on 
balance than noncases and is under 1 if vice-versa. 
A common assumption is that the misclassification 
is nondifferential, that is, that X and Y are indepen- 
dent given T, or equivalently, equal sensitivity and 
specificity across Y. Because I3xy and Pxy + Ptxy 
are the XY log odds ratios in the T = and T = 
1 strata, under nondifferentiality we have (3xy = 
I^TXY = 0, making exp(/3Txy) = 1, Rif3x) = 1, and 
hence, ORty=^w{Pty) (Greenland, 2003c). 

Self-report X could be affected by the outcome Y, 
hence, nondifferentiality is not a justifiable assump- 
tion. Nonetheless, it is plausible that the impact oiY 
on X is limited. Furthermore, it is difficult to predict 
which of cases or noncases would have a higher ROC 
odds ratio: Cases may have improved recall of true 
exposure {(fui > (fiio) but also more false exposure 
recall (v?oii > ^oio)^ which have opposing effects on 
exp(/3Txy)- In line with these considerations, plac- 
ing 95% probability on exp(/3Txy) between ^ and 2 
provides a modest expansion for the distribution of 
0Rtx{1) beyond that of ORtx{0). 

These considerations are also relevant to Pty- If 
the departures from nondifferentiality are limited, 
the departures of /3xy and Ptxy from zero are small, 
which in turn implies that R{(3x) is small and, hence, 
exp(/33-y) is close to ORty (Greenland, 2003c). These 
results suggest using a prior for exp(/?7^y) similar to 
that for ORty, for example, a lognormal prior with 
95% probability between ^ and 4. 

Table 3 presents a set of normal priors that are 
consistent with the preceding considerations, along 
with the implied density for (3tx + Ptxy ■ The corre- 
sponding joint prior density is independent-normal 
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with mean /Xy = [fir, fJ'TX , fJ-TY, fJ-Txr)' = (logit(O.l), 
ln(13.5), 0, 0)' and covariance matrix with diagonal 
v>x = {i^Ti'i^TX-i^'TY -.i^txy)' = (0.16,0.25,0.50, 
0.125)'. In the general notation with = (3rp and 
A= [^rp^uT]-, the joint prior density H{Pj^;\) cor- 
responds to the penalty — 21n{i?(/3y; A)} = Sj(/3j — 
where i = T, TX, TY, TXY. 
As one gauge of prior information. Table 3 shows 
the number rij = 4/t'j of Bernoulli(^) trials with B 
"successes" that would make the approximate sam- 
pling variance l/{nj(^)(^)} = 4/nj of logit(i?/nj) 
equal to the prior variance fj. One can penalize /3j 
with ordinary maximum-likelihood logistic-regression 
software by entering a data record with hi = 2/ui 
"successes" out of rii = A/vi trials, zero for all co- 
variates except i (for which it is 1) and an offset 
— /ij (Greenland, 2007a). The result is a binomial 
likelihood contribution 

Li = L{/3i; hi) tx expit(/3i - /ij)^' expit(-/3i + m)^' 

= exp(ft - fii)^^/{l + exp(A - ^i)}2^% 

which is close to normal for rii > 8, becoming heavier- 
tailed for smaller rii. With = {bT,bTX,bTY, 
brxY)' = 2/^'T, we obtain ff(/3'p;A) = F{hx;Pj,^ 

As another gauge of prior information, hi may also 
be interpreted as the number of cases one would 
have to observe in each arm of a randomized trial 
of a treatment and a rare outcome with allocation 
ratio exp{—fii) to obtain an approximate variance of 
Vi for the log odds-ratio estimate (Greenland, 2006). 
More generally, as mentioned earlier, one can derive 
the hi, rii, and offset or allocation needed to produce 
a likelihood that is exactly proportional to a gener- 
alized log--F density with 26j and 2(nj— bj) degrees 
of freedom; this extension allows skewness or heavy 
tails for the prior (Greenland, 2003b, 2007b). 

3.6 Penalized-Likelihood and 
Posterior-Sampling Analyses 

Using the loglinear parameterization and prior like- 
lihoods Li, the penalized likelihood is L(/3|axy , A) = 

L{(3;eixY)H{PT-A) = L{P;axY)L{PT;bx, t^r), m 
which L(/3;axy) derives from actual-data records 
with T missing, and L{(3j'; h\, jjirp) derives from hy- 
pothetical complete-data records. Analysis can then 
proceed using standard likelihood methods for miss- 
ing data (McLachlan and Krishnan, 1997; Little and 
Rubin, 2002). Alternatively, using the transparent 
parameterization, we obtain independent draws from 



the exact marginal posterior p[ORty\^xy) as fol- 
lows: (1) draw ^*xY P(Exy l^xy); (2) draw 
from p{Pp\'E\y)'-> (3) compute tt^ from E^j^y = 

E+ly'^tly + E+Oy'^tOyy ^nd OR^y = ^l+l-^'o+o/ 

£'*_^g£^Q^^. With a noninformative prior for Exy 
and p{(3j'\Exy) = p{Pt)j as here, step (2) reduces to 
drawing (3^^ from p{(3rp) and the resulting sampler is 
approximated by Monte Carlo Sensitivity Analysis 
(MCSA) in which bootstrap draws ^XY from axy 
replace E^^y (Greenland, 2005a). 

Using the partial prior p{(3x) ™ Table 3 with Atxy 
either Poisson or multinomial conditional on the Y 
margin (a++i,a+4.o), the penalized-likelihood esti- 
mate for ORty is 1.19, with Wald 95% limits of 
0.41, 3.43. Both exact posterior sampling and MCSA 
with 250,000 draws yield a median for ORty of 1.19 
with 2.5th and 97.5th percentiles of 0.37, 3.42, not 
much different in practical terms from the 95% prior 
limits (0.25, 4). The posterior variance of ORty is 
more sensitive to the prior variance vty for (3ty 
than to the other prior variances. Upon increasing 
uty to make the prior 95% limits for exp(/37-y ) equal 
to 0.125, 8, the 2.5th and 97.5th sampling percentiles 
for ORty become 0.20, 6.1, again not much differ- 
ent from the prior in practical terms. A common 
response to this variance sensitivity would be to set 
a hyperprior on the prior variance vty] that would, 
however, obscure both the contextual meaning of 
the prior and the extreme sensitivity of the results 
to Vty- 

In examing these results, there are several ways 
to contrast the contribution of p{(3t\^xy), which 
represents uncertainty about tvt, against the con- 
tribution of p(Exy|axy), which represents uncer- 
tainty about Exy. One natural way to gauge the 
contribution of p{[3t\^xy) is to contrast posterior 
intervals, such as the 95% posterior sampling inter- 
val (0.37, 3.42), against analogous intervals that as- 
sume no uncertainty about ttt, such as the conven- 
tional 95% interval (1.11, 1.83). Another way is to 
take the ratio of the estimated sampling variance of 
ln( Oi?xy) to the posterior variance of ln( OiJyy), 
which here is 5.6%. Either way, the results show 
that the precision of the conventional frequentist 
and Bayesian results is due entirely to the equal- 
ity constraints on the predictive values irtxy, that is, 
v{''^iiy = 1) = piT^OQy = 1) = 1- This is unsurprising 
insofar as ORty is not identified by the expanded 
likelihood; hence, the data add little information 
about ORty beyond that in the prior. 
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3.7 Dependent Parameterizations 

The general ideas discussed so far apply to arbi- 
trary parameterizations of the data model G{a; "fjO), 
including those in which 9 is partially identified 
through dependence on 7. In the misclassification 
problem an example occurs when the bias param- 
eter 6 is taken to be Px = {Px 1 Ptx , PxY , Ptxy)' 
rather than = {Pt, Ptx, Pty, Ptxy)' ■ Specifica- 
tion of 6 = (3x and its prior follows naturally when 
the initial priors are for the true and false posi- 
tive probabilities (the (fuy), because these proba- 
bilities are functions solely of Px '■ fUy = expit(/3x + 
I^Txt + PxyU + PrxYty) ■ However, the identified ex- 
pectations in 7 = ExY imply bounds on the (fuy] 
hence, Exy constrains Px data Axy iden- 

tify these constraints. In other words, unlike Exy 
and /3y, Exy and Px are variation dependent, which 
can viewed as a logical prior dependence. 

Such dependence can be handled by general fit- 
ting methods (Joseph, Gyorkos and Coupal, 1995; 
Gustafson, Le and Saskin, 2001; Gustafson, 2003), 
but invalidates simplified posterior computations like 
MCSA that assume no updating of the bias pa- 
rameters. And although a proper prior on Px will 
identify the target parameter ORty, it will lead to 
an improper posterior for the full parameter P if 
no further prior specification is made (see Gelfand 
and Sahu, 1999, for more general results along these 
lines). Thus, if a dependent parameterization is pre- 
ferred (say, for ease of prior specification), one way 
to proceed is to penalize as necessary to ensure iden- 
tification of the full parameter vector and employ 
fitting methods that do not assume prior indepen- 
dencies. As illustrated in Section 3.5, however, one 
could instead retain the transparent parameteriza- 
tion (and the simplifications its use entails) by in- 
corporating prior information on the (fuy into the 
prior specification process for Pj-. 

3.8 How Conditional Should the Probability 
Model Be? 

Log-linear analysis of case-control counts was in- 
troduced over 30 years ago (Bishop, Fienberg and 
Holland, 1975) but appears to have been forgotten 
in favor of logistic regression with Y as the outcome, 
which developed in the same era. The history is un- 
surprising: Unlike logistic regression, the usual log- 
linear approach requires categorization of continu- 
ous covariates and is quite limited in the number 
of covariates it can handle. Furthermore, in case- 
control studies (in which the Y total is constrained 



by design), penalization does not affect the consis- 
tency of odds-ratio estimates from logistic regres- 
sion with Y as the outcome (Greenland, 2003b, Sec- 
tion 3). Nonetheless, the log-linear approach pro- 
vides a model of the joint distribution of observed 
and latent variables in a single regression, which can 
greatly simplify bias analysis, hence its resurrection 
here. 

Most literature on tabular data adheres to models 
that condition on the total or a margin of the tabu- 
lar data, even when those quantities are not fixed by 
design. This practice creates no issue for odds-ratio 
analyses: One obtains identical likelihood-based in- 
ferences on odds ratios from multinomial or bino- 
mial (conditional) and Poisson (unconditional) sam- 
pling models that include fixed margins as uncon- 
strained log-linear effects (Bishop, Fienberg and Hol- 
land, 1975, Sections 3.5 and 13.4.4). Nonetheless, 
these design effects imply that no prior should be 
placed on parameters that are functions of sam- 
ple size or sampling ratios. For example, in a case- 
control study the log-linear intercept and disease co- 
efficient, /3o in /3y model (1), are functions of the de- 
sign and so should receive no prior. These consider- 
ations are a further reason for adopting the partial- 
prior (semi-Bayes) analysis used here. 

4. EXTENSIONS AND GENERALIZATIONS 

The above formulation has straightforward exten- 
sions to other biases and more general data models. 
These are sketched briefly here. 

4.1 Validation and Alternative Measurements 

Adding a plausible measurement model shows that 
the SIDS data offer far less information about the 
target ORty than the conventional analysis makes 
it seem. Sharper inference about ORty requires 
sharper information about the TXY distribution. 
Short of measuring T directly on everyone in the 
study, such information might come from an alter- 
nate measurement of T on a sample from the 
source population of the study, along with X and 
Y. If this measurement is error-free {W = T) or as- 
sumed so, the alternate measurements are called 
"validation data" (Carroll et al., 2006) and yield 
actual-data records with T present. 

Unfortunately, validation data are often unavail- 
able, impractical to obtain in a timely manner or in- 
adequate in quantity. Then too, they suffer their own 
errors and biases. Subjects do not randomly refuse 
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Table 4 

SIDS data separated into strata with prescription examined in medical record ("validated," assuming W — T) and remainder: 

W = 1 if record shows prescription, W — if not 

Y = 1 Y = 



X = l X = X = l X = 

Medical-record data on W: 

W = \ 29 17 21 16 

W = 22 143 12 168 

Totals 51 160 33 184 

TTi^y 0.569 0.106 0.636 0.087 

No W data: 

W missing 122 442 101 479 
Imputed counts (for W — 1, i^ixy times W-missing count; for W = Q, noxy times Vy-missing count): 

W = l 73.2 44.2 60.6 47.9 

W = Q 48.8 397.8 40.4 431.1 



further study, and alternate measurements have er- 
rors iW / T). Thus, regardless of the data available, 
an accurate uncertainty assessment requires an ex- 
panded model to link the observations [TXY and 
the partially observed W) to unobserved variables 
(T and missing W). The identification problem is 
not removed, rather W is added to certain records, 
whose informativeness depends entirely on the pri- 
ors relating them to the target variables. 

In the SIDS example, a pseudo-random sample 
of medical records was used to check maternal re- 
sponses in the subsample (Drews, Kraus and Green- 
land, 1990). W is the record indicator for antibi- 
otic prescription. Table 4 shows the data from Table 
1 separated into VF-known (alternate or complete- 
data, with = 1 or 0) and VF-unknown (incomplete- 
data) strata, li W = T, the resulting data provide 
a likelihood for the T\XY parameter vector tzt or 
l3x- Because the TXY model is saturated, maxi- 
mum likelihood simplifies to using the MLEs T^txy 
from Table 4 in place of the ntxy in Table 2 to im- 
pute T where it is missing, followed by collapsing 
over X (Lyles, 2002). The resulting marginal TY 
odds ratio ORty is the MLE of ORty- For unsat- 
urated models, ORty has no closed form but may 
be replaced by any reasonably efficient closed-form 
estimator (Greenland, 2007c); otherwise a full likeli- 
hood method may be used (Espeland and Hui, 1987; 
Little and Rubin, 2002; Carroll et al. 2006). 

Assuming W = T, ORty = 1-21 with Wald 95% 
confidence limits 0.79, 1.87. Adding the partial prior 
in Table 3, an approximate posterior median for 
ORty is 1.20 with 95% Wald limits of 0.81, 1.77, 



close to the results without the prior. This result 
shows that the prior is considerably less informative 
than the record data when we assume W = T. But, 
as with X = T, the constraint W = T is unjustified: 
First, the records only show prescription, not com- 
pliance (hence, we should expect for some women 
T < W); second, the records must have some errors 
due to miscellaneous oversights (e.g., miscoding). 

Even if we assume that oversights are negligible, 
the effect of is a prescribing (intention-to-treat) 
effect, and thus (due to noncompliance) is likely bi- 
ased for the biologic effect of T. As before, if we 
lack T for samples of cases and controls, identifica- 
tion of ORty depends entirely on priors for 'JV\yjxy — 
Pr(r = 1\W = w,X = x,Y = y). Frequentist analy- 
ses arise from sets of equality constraints (point pri- 
ors) that identify the parameter of interest. W = T is 
sufficient by itself but implausible, so less strict con- 
straints such as cov(l^, X|T) = may be introduced 
along with other conditions as needed for identifi- 
cation (Hui and Walter, 1980; Carroll et al, 2006; 
Messer and Natarajan, 2008). 

Additional data sources or variables may also pro- 
vide partial identification (Johnson, Gastwirth and 
Pearson, 2001; Small and Rosenbaum, 2009). But 
without such information, results depend on the TTiwxy 
priors in an unlimited fashion: With noninformative 
priors for the Tiiwxy, we obtain a noninformative pos- 
terior for ORty- If these priors are only vaguely in- 
formative, as those above, the posterior distribution 
for ORty will be very dispersed. 

I omit an extended {TWXY) analysis because it 
would merely illustrate again how posterior concen- 
tration is purchased by using extremely informative 
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priors even when alternate measurements are made. 
Such priors may sometimes be plausible. Nonethe- 
less, in many situations the true exposure history 
can never be known without considerable potential 
for systematic error (e.g., lifetime occupational ex- 
posures, environmental exposures and nutrient in- 
takes). In these situations, equality constraints need 
to be recognized as priors, because failure to do so 
risks overconfident inferences. 

These remarks should not be taken as discour- 
aging collection of additional predictors of the true 
exposure T, since such data provide an empirical ba- 
sis for addressing measurement issues. The present 
discussion merely cautions against overlooking the 
nonidentified elements in any model for their use. 

4.2 Unmeasured Confounders 

Consider a setting in which X rather than T is 
the exposure variable of interest and T is an un- 
measured confounder of the effect of X on Y . The 
target effect is now that of X on Y; nonetheless, 
the regression models used for misclassification can 
be applied unchanged. This effect may be parame- 
terized by the pair of T-conditional X-Y odds ra- 
tios exp(/3xy) and exp(/3xy + Ptxy)- It is usually 
assumed that these odds ratios are equal {Ptxy = 
0), leaving exp(/3xy) as the target; although this 
equality is another unjustified point prior, it may 
incur only minor bias in estimating summary ef- 
fects (Greenland and Maldonado, 1994). With the 
assumption, the T-adjusted odds ratio exp(/3xy) is 
related to the unadjusted odds ratio ORxy by 

Rif3T) = ORxY/exp{f3xY) 

= {1 + exp(/3T + Ptx + /3ry)}{l + expiPr)} 

/{I + exp(/3T + /3tx)}{1 + exp(/?T + M} 

(Yanagawa, 1984). Without data on T, (3rp and hence 
R{[3rp) are not identified. Thus, assuming p{Px\ 
Exy) =p{Pt) = A), to draw exp(/3^y) from 

p{exp (/3xy ) I a^y } , we draw E^^y from p(£'xy | a^y ) , 
compute OR*xY = -E'* n-E'+ooZ-^+io-^+oi' draw 
from H{f3j^;X), and compute exp(/3^y) = OR^y/ 
R{f3*rp). Assuming p(/37., Exy) =p(/9t). MCSA uses 
bootstrap draws a^^^y from a^y in place of E^^y 
(Greenland, 2003a). 

In two-stage (two-phase) studies, T is measured 
on subsamples of subjects randomly selected within 
X-Y levels (White, 1982; Walker, 1982). This design 
is formally identical to validation subsampling; the 
resulting complete records may be entered into the 
analysis as described earlier. 



4.3 Selection Bias 

Consider again a setting in which T is the expo- 
sure variable of interest. Let X = \ — S where S is 
the selection indicator, so only subjects with X = 
are observed. The models and target parameter 
ORty used for misclassification are unchanged, but 
now observed records are complete (include T, X, Y) 
and they are confined to the X = stratum: The ob- 
servations are ao = (aioi, aioo, «ooi, aooo)'- 

With no data at X = 1, the log-linear parame- 
terization is transparent with identified component 
7 = (/3o) /5t) /5y , I^ty)' and nonidentified X\TY com- 
ponent e = (3x = {/3x,PTX,PxY,/3TXYy- The TY 
odds-ratio parameter in the X = stratum is 
exp(/3ry) = EiqiEqoq/ EiqqEqoi, and is related to the 
target by ORty = ^^'9{Pty)R{Px)- Thus, assum- 
ing p{(3x\^q)=p{(3x) = HiPx;^), to draw OR*ty 
from p{ORty\^o}, we draw Eq from p(£'o|ao), com- 
pute exp(/3^y) = EIq^^^E^qq/E^qqE^q;^, draw /3*x from 
iJ(/3x;A), and compute OR^y = ^^Pil^TY)^(f^*x)- 
Assuming p{Pxj^xy) =p{Px)i MCSA uses boot- 
strap draws aQ from ao in place of Eq. If, however, 
selection is modeled as a Poisson process with TY- 
dependent sampling rate exp(/35 + fisT^ + I^syU + 
PsTYty), as in "density" (risk-set) sampling, R{Px) 
simplifies to exp{—l3sTY), hence, one need only spec- 
ify and sample from p{fisTY) (Greenland, 2003a). 

Occasionally, information on nonrespondents (sub- 
jects with X = 1) becomes available. Such informa- 
tion may arise from general records or from call-back 
surveys of nonrespondents. Nonetheless, respondents 
in call-back surveys are unlikely to be a random sam- 
ple of all the original nonrespondents, hence, further 
parameters will be needed to relate survey exclusion 
to T, X and Y. 

4.4 Multiple Biases and Multiple Variables 

The above approach supplements the observed vari- 
ables Z (representing available measurements) with 
wholly latent variables T (representing unobserved 
target variables and unmeasured confounders). It 
then formulates an identified observable model P(z|'y), 
a selection-rate model S{t,'z\ an imputation 
model P(t|z; jSj^) for T and a plausible prior H{9\ A) 
for the nonidentified 9 = {(3g,(3rp). Inference to pop- 
ulation quantities involving T can then be based on 
p(t,z) oc P(t|z;/3r)P(z|7)/S'(t,z;/35). For discrete 
data, we may replace P(z|'y) with Ez;(7), the ex- 
pected data count at Z = z. With ^(7, 6) = p{'y)p{9) , 
posterior sampling reduces to sampling from p{'y\z) x 
H{0;X); with an improper prior p{'y,6) = p{6) = 
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H{d; A), we may replace P{z\'y) by its bootstrap es- 
timate. Addition of identifying data on T is handled 
in the more usual Bayesian framework. 

The general approach models the joint distribu- 
tion of all variables in the problem, including sev- 
eral wholly latent variables; thus, the number of pa- 
rameters can become huge. Effective degrees of free- 
dom can be reduced via hierarchical modeling of the 
parameters (Greenland, 2003a, 2005a); for exam- 
ple, Greenland and Kheifets (2006) analyzed 60 ob- 
served counts with hierarchical models that included 
135 first-stage (data-level) bias parameters gener- 
ated from second-stage linear models. The profusion 
of parameters reflects a reality of observational re- 
search hidden by conventional analyses, which im- 
plicitly set most parameters to zero. Nonetheless, 
uncertainty can often be addressed adequately by 
rather simple analyses of one or two biases; in the 
example, those analyses quickly reveal that the data 
cannot sustain any accurate inference about the tar- 
get parameter given uncertainties about the bias 
sources. 

4.5 Semi-parametric Modeling 

Semi-parametric methods have been extended to 
incorporate nonidentified confounding and selection 
biases when these biases reduce to simple multiplica- 
tive or additive forms (e.g., Brumback et al., 2004; 
Robins, Rotnitzky and Scharfstein, 2000; Scharfstein, 
Rotnitsky and Robins, 1999). Adjustment factors in 
these extensions correspond to model-based factors 
such as R{j3x)-, but lack the finer parametric struc- 
ture of the latter. As illustrated above, the param- 
eters within these factors can serve in prior speci- 
fication from background information. This role is 
important insofar as prior specification is the hard- 
est task in bias modeling, especially because nonin- 
formative and other reference priors are not serious 
options for nonidentified parameters. 

Note that semi-parametric robustness is achieved 
by only partially specifying the distribution of oh- 
servables, and thus does not extend to specifica- 
tion of nonidentified model components. Nonethe- 
less, the approach illustrated here can be used to 
extend semi-parametric models by penalizing the 
partial- or pseudo-loglikelihood, or by subtracting 
half the penalty gradient from the estimating func- 
tion (or equivalently, adding the gradient of the log 
partial prior to that function). 



5. DISCUSSION 

The present paper has addressed settings in which 
target models or parameters are not identified, and 
hence, the data cannot tell us whether we are close 
to or far from the target, even probabilistically. There 
are two sound responses by the analyst. One is to 
focus on describing the study and the data, resist- 
ing pressures to make inferences, in recognition that 
a single observational study will provide a basis for 
action only in extraordinary circumstances (Green- 
land, Gago-Domiguez and Castellao, 2004). If in- 
stead inference is mandated, as in pooled analyses 
to advise policy, we must admit we can only propose 
models that incorporate or are at least consistent 
with facts as we know them, and that all inferences 
are completely dependent on these modeling choices 
(including nonparametric or semi-parametric infer- 
ences). 

In the latter process, we must recognize that there 
will always be an infinite number of such models 
and they will not all yield similar inferences. In this 
sense, statistical modeling provides only inferential 
possibilities rather than inferences. Any analysis 
should thus be viewed a part of a sensitivity analy- 
sis which depends on external plausibility considera- 
tions to reach conclusions (Greenland, 2005b; 
Vansteelandt et al., 2006). Results from single mod- 
els are merely examples of what might be plausi- 
bly inferred, although just one plausible inference 
may suffice to demonstrate inherent limitations of 
the data. 

Vansteelandt et al. (2006) offer a rationale for 
their region-constraint approach beyond those men- 
tioned above (Section 2.1): To keep ignorance about 
6 (uncertainty about bias), expressed as the region 
R, distinct from imprecision (statistical or random 
error) as sources of uncertainty about r. As shown 
in the example, the same distinction can be made 
when using relaxation penalties, and the two sources 
of uncertainty can be compared. Nonetheless, in ob- 
servational health and social science there is no ob- 
jective basis for the data model G(a; 7, 6) (no known 
randomizer, random sampler or physical law), which 
undermines the physical distinction between igno- 
rance and imprecision. In these settings, G{a;y,9) 
merely expresses our conditional (residual) ignorance 
about where the data would fall even if we were 
given (7,0); it differs from H{6;X) only in that 
G(a;7,0) is invariably a conventional (intersubjec- 
tive) form representing constraints that would have 
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been enforced by an experimental design, but in re- 
ality were not enforced. 

Whatever their value for summarization, conven- 
tional models do not satisfy plausibility considera- 
tions because they incorporate point constraints on 
unknown parameters. These include many bias pa- 
rameters that can be forced to their null by success- 
ful design strategies, but are probably not null in 
most observational settings. Likewise, interval con- 
straints rarely satisfy all plausibility considerations 
and thus may not be suitable for assessing total un- 
certainty (as opposed to providing sensitivity-analysis 
summaries). In contrast, relaxation penalties and 
priors allow expansion of conventional models and 
point constraints into the plausible realm, and thus 
can provide more plausible inferences. These capa- 
bilities justify their addition to basic statistical train- 
ing for observational sciences. Progress beyond such 
penalties can be made only by obtaining data from a 
design that eliminates or at least partially identifies 
at least one previously nonidentified bias parameter 
(Rosenbaum, 1999; Greenland, 2005a). 
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